I am looking forward to the course for two reasons:
Another initial impression of the course to me that the moodle page is rather verbose and chaotic and this hampers finding the necessary information and the assigned tasks. I hope that I will get used to this and will not miss any compulsary assignment.
I have found the course in Sisu because I still needed one credit to my Transferable skills section.
## [1] "I've found the book interesting and well written. Naturally due to my previous experience in R, I was only skim-reading it. Nevertheless I have found some useful functionalities that I haven't used before. e.g."
filter(year == 2020 | disease == COVID19)gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, fill = country)) +
geom_col(position="dodge") + theme(legend.position = "none")
percent() function in tidyverse,
it’s much simpler than sprintf(%2.1f%%).plotly and
html output of Rmarkdown one can create interactive htmls
(as one can hover over the diagram below to highlight countries).plot = gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
geom_point()
ggplotly(plot)
Describe the work you have done this week and summarize your learning.
We got data on some sort of questionnaire responses from individuals and their results on some sort of test (exam). The questionnaire seems to refer to study habits. The origin of the data cannot be clearly identified from the metadata provided.
df = as_tibble(read.csv('http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt', sep = '\t'))
learning2014 = read_csv('data/learning2014.csv') %>%
rename(points = Points, attitude = Attitude)
The original raw data consists of 183 records and 60 variables. After summarizing some questions into groups (such as questions referring to in-depth studying into a “depth” group) by taking their mean and removing observations with 0 points the processed data consists of 166 observations and 7 variables. Information on the data structure can be found below.
str(learning2014)
## spec_tbl_df [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ Age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. Age = col_double(),
## .. Attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. Points = col_double()
## .. )
Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot. The figure below provides detailed insight into this data. We suspect that there are differences in the male and female participants hence we stratify the data into these two groups by colors.
colMap = c("F" = "#FF5555", "M" = "1100EE")
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(color = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor")) ) +
scale_color_manual(values = colMap) +
scale_fill_manual(values = colMap)
We can read several pieces of information from the plot. First there were almost 1.5 times more females (red) in the study than males (blue) as can be seen by the barchart on the top-right. Based on the boxplots on the top the males attitude was higher towards statistics however there were no huge difference between the points of males and females based on gender. Nevertheless the positive correlation between attitude and points is observable for both males and females.
In a linear regression analysis we are searching for a linear relationship between a predictor (feature) variable and an expected target outcome. Let’s pick 3 that are the most likely candidates based on the previous plot:
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results show that the attitude is significantly useful when predicting the exam points as the p-value for the slope of this variable is very low (\(1.93*10^{-8}\)). The F-statistic of the test also indicates that at least one of the selected variables has non-zero slope.
The t-test for each variable has the null-hypothesis that the slope of that variable in the regression equation is 0. If that is true then there is no association between the variable and the target. If the p-value is low then the result is unlikely under the null-hypothesis so usually we reject it i.e. we believe that there is association.
So fit the model again only with attitude (as that was the only one showing significance)
my_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The summary of this model shows that there is a positive correlation (with a slope of 3) between the attitude results and exam points. In lay terms this means that 1 unit increase in the attitude reflects more or less 3.5 points increase in the exam results. In addition to this there is a baseline of 11.6 for those with the lowest attitudes. The multiple R squared value gives the proportion of variation that is explained by the variables in the model. In this case it is 0.1906, so a little less than 20% of the variation is explained by the explanatory variable meaning that there are significant other factors affecting the exam results that we are omitting here.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The linear regression model assumes that the observations are produced by a linear model from the explanatory variable and there is additive noise with normal distribution (0 expected value) to these. The residuals represent this normal distribution, hence it’s good to check how normal these are, and if the residual values are independent of the “x” location (in other case the noise would be dependent on the observation). The first two plots refer to this. On the first one we can see that the residuals are evenly distributed over the fitted values. The Q-Q plot shows that the quantiles of the residual distribution are very close to the quantiles of a normal distribution. The last “Residuals vs. Leverage” plot indicates which variables may have the greatest effect on the parameters (it is also known that linear regression is outlier-prone), hence it could be useful to check of the model improves a lot by removing those observations.
Beyond explaining the data linear regression models can be also used to predict unobserved results such as below.
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)
# Print out the new data
new_data
## attitude
## Mia 3.8
## Mike 4.4
## Riikka 2.2
## Pekka 2.9
# Predict the new students exam points based on attitude
predict(my_model, newdata = new_data)
## Mia Mike Riikka Pekka
## 25.03390 27.14918 19.39317 21.86099
(more chapters to be added similarly as we proceed with the course!)